{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NVIDIA NIMs\n",
    "\n",
    "The `llama-index-llms-nvidia` package contains LlamaIndex integrations building applications with models on \n",
    "NVIDIA NIM inference microservice. NIM supports models across domains like chat, embedding, and re-ranking models \n",
    "from the community as well as NVIDIA. These models are optimized by NVIDIA to deliver the best performance on NVIDIA \n",
    "accelerated infrastructure and deployed as a NIM, an easy-to-use, prebuilt containers that deploy anywhere using a single \n",
    "command on NVIDIA accelerated infrastructure.\n",
    "\n",
    "NVIDIA hosted deployments of NIMs are available to test on the [NVIDIA API catalog](https://build.nvidia.com/). After testing, \n",
    "NIMs can be exported from NVIDIA’s API catalog using the NVIDIA AI Enterprise license and run on-premises or in the cloud, \n",
    "giving enterprises ownership and full control of their IP and AI application.\n",
    "\n",
    "NIMs are packaged as container images on a per model basis and are distributed as NGC container images through the NVIDIA NGC Catalog. \n",
    "At their core, NIMs provide easy, consistent, and familiar APIs for running inference on an AI model.\n",
    "\n",
    "# NVIDIA's LLM connector\n",
    "\n",
    "This example goes over how to use LlamaIndex to interact with and develop LLM-powered systems using the publicly-accessible AI Foundation endpoints.\n",
    "\n",
    "With this connector, you'll be able to connect to and generate from compatible models available as hosted [NVIDIA NIMs](https://ai.nvidia.com), such as:\n",
    "\n",
    "- Google's [gemma-7b](https://build.nvidia.com/google/gemma-7b)\n",
    "- Mistal AI's [mistral-7b-instruct-v0.2](https://build.nvidia.com/mistralai/mistral-7b-instruct-v2)\n",
    "- And more!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install --upgrade --quiet llama-index-llms-nvidia llama-index-embeddings-nvidia llama-index-readers-file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "**To get started:**\n",
    "\n",
    "1. Create a free account with [NVIDIA](https://build.nvidia.com/), which hosts NVIDIA AI Foundation models.\n",
    "\n",
    "2. Click on your model of choice.\n",
    "\n",
    "3. Under Input select the Python tab, and click `Get API Key`. Then click `Generate Key`.\n",
    "\n",
    "4. Copy and save the generated key as NVIDIA_API_KEY. From there, you should have access to the endpoints."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import getpass\n",
    "import os\n",
    "\n",
    "# del os.environ['NVIDIA_API_KEY']  ## delete key and reset\n",
    "if os.environ.get(\"NVIDIA_API_KEY\", \"\").startswith(\"nvapi-\"):\n",
    "    print(\"Valid NVIDIA_API_KEY already in environment. Delete to reset\")\n",
    "else:\n",
    "    nvapi_key = getpass.getpass(\"NVAPI Key (starts with nvapi-): \")\n",
    "    assert nvapi_key.startswith(\n",
    "        \"nvapi-\"\n",
    "    ), f\"{nvapi_key[:5]}... is not a valid key\"\n",
    "    os.environ[\"NVIDIA_API_KEY\"] = nvapi_key"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio\n",
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Working with NVIDIA API Catalog"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.llms.nvidia import NVIDIA\n",
    "from llama_index.core.llms import ChatMessage, MessageRole\n",
    "\n",
    "llm = NVIDIA()\n",
    "\n",
    "messages = [\n",
    "    ChatMessage(\n",
    "        role=MessageRole.SYSTEM, content=(\"You are a helpful assistant.\")\n",
    "    ),\n",
    "    ChatMessage(\n",
    "        role=MessageRole.USER,\n",
    "        content=(\"What are the most popular house pets in North America?\"),\n",
    "    ),\n",
    "]\n",
    "\n",
    "llm.chat(messages)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Working with NVIDIA NIMs\n",
    "\n",
    "In addition to connecting to hosted [NVIDIA NIMs](https://ai.nvidia.com), this connector can be used to connect to local microservice instances. This helps you take your applications local when necessary.\n",
    "\n",
    "For instructions on how to setup local microservice instances, see https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.llms.nvidia import NVIDIA\n",
    "\n",
    "# connect to an chat NIM running at localhost:8080, spcecifying a specific model\n",
    "llm = NVIDIA(\n",
    "    base_url=\"http://localhost:8080/v1\", model=\"meta/llama3-8b-instruct\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading a specific model\n",
    "Now we can load our `NVIDIA` LLM by passing in the model name, as found in the docs - located [here](https://docs.api.nvidia.com/nim/reference/)\n",
    "\n",
    "> NOTE: The default model is `meta/llama3-8b-instruct`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# default model\n",
    "llm = NVIDIA()\n",
    "llm.model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can observe which model our `llm` object is currently associated with the `.model` attribute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "llm = NVIDIA(model=\"mistralai/mistral-7b-instruct-v0.2\")\n",
    "llm.model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Functionality\n",
    "\n",
    "Now we can explore the different ways you can use the connector within the LlamaIndex ecosystem!\n",
    "\n",
    "Before we begin, lets set up a list of `ChatMessage` objects - which is the expected input for some of the methods."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll follow the same basic pattern for each example: \n",
    "\n",
    "1. We'll point our `NVIDIA` LLM to our desired model\n",
    "2. We'll examine how to use the endpoint to achieve the desired task!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Complete: `.complete()`\n",
    "\n",
    "We can use `.complete()`/`.acomplete()` (which takes a string) to prompt a response from the selected model.\n",
    "\n",
    "Let's use our default model for this task."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "completion_llm = NVIDIA()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can verify this is the expected default by checking the `.model` attribute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "completion_llm.model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's call `.complete()` on our model with a string, in this case `\"Hello!\"`, and observe the response."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "completion_llm.complete(\"Hello!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As is expected by LlamaIndex - we get a `CompletionResponse` in response."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Async Complete: `.acomplete()`\n",
    "\n",
    "There is also an async implementation which can be leveraged in the same way!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "await completion_llm.acomplete(\"Hello!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Chat: `.chat()`\n",
    "\n",
    "Now we can try the same thing using the `.chat()` method. This method expects a list of chat messages - so we'll use the one we created above.\n",
    "\n",
    "We'll use the `mistralai/mixtral-8x7b-instruct-v0.1` model for the example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "chat_llm = NVIDIA(model=\"mistralai/mixtral-8x7b-instruct-v0.1\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All we need to do now is call `.chat()` on our list of `ChatMessages` and observe our response.\n",
    "\n",
    "You'll also notice that we can pass in a few additional key-word arguments that can influence the generation - in this case, we've used the `seed` parameter to influence our generation and the `stop` parameter to indicate we want the model to stop generating once it reaches a certain token!\n",
    "\n",
    "> NOTE: You can find information about what additional kwargs are supported by the model's endpoint by referencing the API documentation for the selected model. Mixtral's is located [here](https://docs.api.nvidia.com/nim/reference/mistralai-mixtral-8x7b-instruct-infer) as an example!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "chat_llm.chat(messages, seed=4, stop=[\"cat\", \"cats\", \"Cat\", \"Cats\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected, we receive a `ChatResponse` in response."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Async Chat: (`achat`)\n",
    "\n",
    "We also have an async implementation of the `.chat()` method which can be called in the following way."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "await chat_llm.achat(messages)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stream: `.stream_chat()`\n",
    "\n",
    "We can also use the models found on `build.nvidia.com` for streaming use-cases!\n",
    "\n",
    "Let's select another model and observe this behaviour. We'll use Google's `gemma-7b` model for this task."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stream_llm = NVIDIA(model=\"google/gemma-7b\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's call our model with `.stream_chat()`, which again expects a list of `ChatMessage` objects, and capture the response."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "streamed_response = stream_llm.stream_chat(messages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "streamed_response"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, the response is a generator with the streamed response. \n",
    "\n",
    "Let's take a look at the final response once the generation is complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "last_element = None\n",
    "for last_element in streamed_response:\n",
    "    pass\n",
    "\n",
    "print(last_element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Async Stream: `.astream_chat()`\n",
    "\n",
    "We have the equivalent async method for streaming as well, which can be used in a similar way to the sync implementation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "streamed_response = await stream_llm.astream_chat(messages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "streamed_response"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "last_element = None\n",
    "async for last_element in streamed_response:\n",
    "    pass\n",
    "\n",
    "print(last_element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Streaming Query Engine Responses\n",
    "\n",
    "Let's look at a slightly more involved example using a query engine!\n",
    "\n",
    "We'll start by loading some data (we'll be using the [Hitchhiker's Guide to the Galaxy](https://web.eecs.utk.edu/~hqi/deeplearning/project/hhgttg.txt))."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading Data\n",
    "\n",
    "Let's first create a directory where our data can live."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p 'data/hhgttg'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll download our data from the above source."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget 'https://web.eecs.utk.edu/~hqi/deeplearning/project/hhgttg.txt' -O 'data/hhgttg/hhgttg.txt'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll need to have an embedding model for this step! We'll use NVIDIA `NV-Embed-QA` model to achieve this, and save it in our `Settings`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.embeddings.nvidia import NVIDIAEmbedding\n",
    "from llama_index.core import Settings\n",
    "\n",
    "embedder = NVIDIAEmbedding(model=\"NV-Embed-QA\", truncate=\"END\")\n",
    "Settings.embed_model = embedder"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can load our document and create an index leveraging the above"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import VectorStoreIndex, SimpleDirectoryReader\n",
    "\n",
    "documents = SimpleDirectoryReader(\"data/hhgttg\").load_data()\n",
    "index = VectorStoreIndex.from_documents(documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can create a simple query engine and set our `streaming` parameter to `True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "streaming_qe = index.as_query_engine(streaming=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's send a query to our query engine, and then stream the response."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "streaming_response = streaming_qe.query(\n",
    "    \"What is the significance of the number 42?\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "streaming_response.print_response_stream()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tool calling\n",
    "Starting in v0.2.1, NVIDIA supports tool calling.\n",
    "\n",
    "NVIDIA provides integration with the variety of models on build.nvidia.com as well as local NIMs. Not all these models are trained for tool calling. Be sure to select a model that does have tool calling for your experimention and applications.\n",
    "\n",
    "You can get a list of models that are known to support tool calling with,\n",
    "\n",
    "`NOTE:` For more examples refer : [nvidia_agent.ipynb](../agent/nvidia_agent.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tool_models = [\n",
    "    model\n",
    "    for model in NVIDIA().available_models\n",
    "    if model.is_function_calling_model\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With a tool capable model,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.tools import FunctionTool\n",
    "\n",
    "\n",
    "def multiply(a: int, b: int) -> int:\n",
    "    \"\"\"Multiple two integers and returns the result integer\"\"\"\n",
    "    return a * b\n",
    "\n",
    "\n",
    "multiply_tool = FunctionTool.from_defaults(fn=multiply)\n",
    "\n",
    "\n",
    "def add(a: int, b: int) -> int:\n",
    "    \"\"\"Add two integers and returns the result integer\"\"\"\n",
    "    return a + b\n",
    "\n",
    "\n",
    "add_tool = FunctionTool.from_defaults(fn=add)\n",
    "\n",
    "llm = NVIDIA(\"meta/llama-3.1-70b-instruct\")\n",
    "from llama_index.core.agent import FunctionAgent\n",
    "\n",
    "agent_worker = FunctionAgent(\n",
    "    tools=[multiply_tool, add_tool],\n",
    "    llm=llm,\n",
    ")\n",
    "\n",
    "response = await agent.run(\"What is (121 * 3) + 42?\")\n",
    "print(str(response))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "nvidia-llama-index-playground-connector",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
